Background

Why would a biologist learn to code?

In 2009, the American Association for the Advancement of Science and the National Science Foundation released their call to action report called Vision and Change that recommended major changes in undergraduate biology education to reflect the changes in how advances in biology science occur in the 21st century. The authors of the report note “To contribute effectively to this”New Biology“, scientists need to interact with information in new ways, including being able to manage large, complex data sets. Systems approaches and biological modeling rely on the application of mathematics and statistical analysis, while the explosive generation of larger and larger data sets demands increasingly sophisticated computational knowledge.” (Bray et al. 2016; read the article here).

A fundamental element of workflows in ecology and evolution is the analysis of data. Most ecologists now commonly write code as part of their laboratory, field, or modeling research. The transition to a greater reliance on code has been driven by increases in the quantity and types of data used in ecological studies, alongside improvements in computing power and software. Code is written in programming languages such as R and Python, and is used by ecologists, evolutionary biologists, and bioimformaticians for a wide variety of tasks including manipulating, analyzing, and graphing data. A benefit of this transition to code-based analyses is that code provides a precise record of what has been done, making it easy to reproduce, adapt, and expand existing analyses.

Why learn R?

The name “R” refers to the computational environment initially created by Robert Gentleman and Robert Ihaka, similar in nature to the “S” statistical environment developed at AT&T Bell Laboratories. It has since been developed and maintained by a strong team of core developers (R-core), who are renowned researchers in computational disciplines. R has gained wide acceptance as a reliable and powerful modern computational environment for statistical computing and visualisation, and is now used in many areas of scientific computation. Why bother learning it?

  1. R is an in-demand technical skill – R is the single most important data analysis tool in ecology and evolution, with over 60% of 60,000 articles in these domains reporting its use by 2017 (Lai, Lortie, Muenchen, Yang, & Ma, 2019). R is also one of the most sought-after data science skills in industry.

  1. R is free and open source – R is free software, released under the GNU General Public License; this means anyone can see all its source code to see how R works, and there are no restrictive, costly licensing arrangements.. Because of this transparency, there is less chance for mistakes, and if you (or someone else) find some, you can report and fix bugs.

  2. R is interdisciplinary and extensible – With 10,000+ packages that can be installed to extend its capabilities, R provides a framework that allows you to combine statistical approaches from many scientific disciplines to best suit the analytical framework you need to analyze your data. For instance, R has packages for image analysis, GIS, time series, population genetics, and a lot more. Plus R is extensible, which means that procedures for analyzing or visualizing data that do not currently exist can (and probably will) be readily developed.

  3. R works on data of all shapes and sizes – The skills you learn with R scale easily with the size of your dataset. Whether your dataset has hundreds or millions of lines, it won’t make much difference to you. R is designed for data analysis. It comes with special data structures and data types that make handling of missing data and statistical factors convenient. R can connect to spreadsheets, databases, and many other data formats, on your computer or on the web.

  4. R has a large and welcoming community – Thousands of people use and extend R daily. Many of them are willing to help you through mailing lists and websites such as Stack Overflow, or on the RStudio community.

  5. R produces high-quality graphics and interactive web-based content – The plotting and web development functionalities in R are endless, and allow you to adjust any aspect of your graphics and visualizations to convey most effectively the message from your data.

  6. R does not involve lots of pointing and clicking, and that’s a good thing – The learning curve might be steeper than with other software, but with R, the results of your analysis do not rely on remembering a succession of pointing and clicking, but instead on a series of written commands, and that’s a good thing! So, if you want to redo your analysis because you collected more data, you don’t have to remember which button you clicked in which order to obtain your results; you just have to run your script again.

  7. R is great for reproducibility – Reproducibility is when someone else (including your future self) can obtain the same results from the same dataset when using the same analysis. R integrates with other tools to generate manuscripts from your code. If you collect more data, or fix a mistake in your dataset, the figures and the statistical tests in your manuscript are updated automatically. An increasing number of journals and funding agencies expect analyses to be reproducible, so knowing R will give you an edge with these requirements.

The R environment

R is a free software environment for data manipulation, statistics and graphical display. It allows you to load data from pretty much any kind of file, manipulate it, analyze it, and visualize it in pretty much any kind of way, and finally export the output as pretty much any kind of file. It can do pretty much anything then.

R is very much a vehicle for newly developing methods of interactive data analysis. It has developed rapidly, and has been extended by a large collection of packages. However, most programs written in R are essentially ephemeral, written for a single piece of data analysis.

What are R and RStudio?

For much of this book, we will assume that you are using R via RStudio. First time users often confuse the two. At its simplest:

  • R is like a car’s engine.
  • RStudio is like a car’s dashboard.
R: Engine RStudio: Dashboard

More precisely, R is a programming language that runs computations while RStudio is an integrated development environment (IDE) that provides an interface by adding many convenient features and tools. So just as the way of having access to a speedometer, rearview mirrors, and a navigation system makes driving much easier, using RStudio’s interface makes using R much easier as well.

RStudio is currently a very popular way to not only write your R scripts but also to interact with the R software. To function correctly, RStudio needs R and therefore both need to be installed on your computer.

If you want to find out more about the difference between R and RStudio IDE, this video might be helpful DataCamp video.

Installing R and RStudio

You will first need to download and install both R and RStudio (Desktop version) on your computer.

  1. You must do this first: Download and install R.
    • Click on the download link corresponding to your computer’s operating system.
  2. You must do this second: Download and install RStudio.
    • Scroll down to “Installers for Supported Platforms”
    • Click on the download link corresponding to your computer’s operating system.

If you had trouble with these two steps, we suggest you watch this DataCamp video.

Knowing your way around RStudio

Let’s start by learning about RStudio, which is an Integrated Development Environment (IDE) for working with R.

The RStudio IDE open-source product is free under the Affero General Public License (AGPL) v3. The RStudio IDE is also available with a commercial license and priority email support from RStudio, Inc.

We will use RStudio IDE to write code, navigate the files on our computer, inspect the variables we are going to create, and visualize the plots we will generate. RStudio can also be used for other things (e.g., version control, developing packages, writing Shiny apps) that we will not cover during the workshop.

RStudio interface screenshot. Clockwise from top left: Source, Environment/History, Files/Plots/Packages/Help/Viewer, Console.

RStudio interface screenshot. Clockwise from top left: Source, Environment/History, Files/Plots/Packages/Help/Viewer, Console.

RStudio is divided into 4 “Panes”: the Source for your scripts and documents (top-left, in the default layout), your Environment/History (top-right), your Files/Plots/Packages/Help/Viewer (bottom-right), and the R Console (bottom-left). The placement of these panes and their content can be customized (see menu, Tools -> Global Options -> Pane Layout).

One of the advantages of using RStudio is that all the information you need to write code is available in a single window. Additionally, with many shortcuts, autocompletion, and highlighting for the major file types you use while developing in R, RStudio will make typing easier and less error-prone.

Interacting with R

R Console

The basis of programming is that we write down instructions for the computer to follow, and then we tell the computer to follow those instructions. We write, or code, instructions in R because it is a common language that both the computer and we can understand. We call the instructions commands and we tell the computer to follow the instructions by executing (also called running) those commands.

There are two main ways of interacting with R: by using the console or by using script files (plain text files that contain your code). The console pane (in RStudio, the bottom left panel) is the place where commands written in the R language can be typed and executed immediately by the computer. It is also where the results will be shown for commands that have been executed. You can type commands directly into the console and press Enter to execute those commands, but they will be forgotten when you close the session.

If R is ready to accept commands, the R console shows a > prompt. If it receives a command (by typing, copy-pasting or sent from the script editor using Ctrl + Enter), R will try to execute it, and when ready, will show the results and come back with a new > prompt to wait for new commands.

If R is still waiting for you to enter more data because it isn’t complete yet, the console will show a + prompt. It means that you haven’t finished entering a complete command. This is because you have not ‘closed’ a parenthesis or quotation, i.e. you don’t have the same number of left-parentheses as right-parentheses, or the same number of opening and closing quotation marks. When this happens, and you thought you finished typing your command, click inside the console window and press Esc; this will cancel the incomplete command and return you to the > prompt.

Coding basics

Let’s review some basics we’ve so far omitted in the interests of getting you plotting as quickly as possible. You can use R as a calculator:

You can create new objects with <-:

All R statements where you create objects, assignment statements, have the same form:

When reading that code say “object name gets value” in your head.

You will make lots of assignments and <- is a pain to type. Don’t be lazy and use =: it will work, but it will cause confusion later. Instead, use RStudio’s keyboard shortcut: Alt + - (the minus sign). Notice that RStudio automagically surrounds <- with spaces, which is a good code formatting practice. Code is miserable to read on a good day, so giveyoureyesabreak and use spaces.

R Scripts

So far you’ve been using the console to run code. That’s a great place to start. However, because we want our code and workflow to be reproducible, it is better to type the commands we want in the script editor, and save the script. This way, there is a complete record of what we did, and anyone (including our future selves!) can easily replicate the results on their computer.

The script editor is a great place to put code you care about. Keep experimenting in the console, but once you have written code that works and does what you want, put it in the script editor. RStudio will automatically save the contents of the editor when you quit RStudio, and will automatically load it when you re-open. Nevertheless, it’s a good idea to save your scripts regularly and to back them up.

RStudio allows you to execute commands directly from the script editor by using the Ctrl + Enter shortcut (on Macs, Cmd + Return will work, too). The command on the current line in the script (indicated by the cursor) or all of the commands in the currently selected text will be sent to the console and executed when you press Ctrl + Enter. You can find other keyboard shortcuts in this RStudio cheatsheet about the RStudio IDE.

Errors, warnings, and messages

One slightly confusing part of R is how it reports errors, warnings, and messages. The default theme in RStudio colors errors, warnings, and messages in red, which makes them seem like you did something wrong. However, seeing red text in the console is not always bad.

R will show red text in the console in three different situations:

  • Errors: When the red text is a legitimate error, it will be prefaced with “Error in…” and try to explain what went wrong. Generally when there’s an error, the code will not run. For example, as shown in Subsection @ref(package-use) below if you see Error in ggplot(...) : could not find function "ggplot", it means that the ggplot() function is not accessible because the package was not loaded with library(ggplot2), and thus you cannot use it.
  • Warnings: When the red text is a warning, it will be prefaced with “Warning:” and try to explain why there’s a warning. Generally your code will still work, but with some caveats. For example, you see in Chapter @ref(viz) if you plot a scatterplot and one of the rows in your data frame is missing a value, you will see this warning: Warning: Removed 1 rows containing missing values (geom_point). R will still make the scatterplot with all the remaining values, but it’s warning you that one of the points isn’t there.
  • Messages: When the red text doesn’t start with either “Error” or “Warning”, it’s just a friendly message. You’ll see these messages when you load some packages like the dplyr package in Subsection @ref(package-loading) below, or when you read data saved in spreadsheet files with read_csv() as you’ll see in Chapter @ref(tidy). These are helpful diagnostic messages and they don’t stop your code from working.

Remember, when you see red text in the console, don’t panic. It doesn’t necessarily mean anything is wrong.

  • If the text starts with “Error”, figure out what’s causing it. Think of errors as a red traffic light: something is wrong!
  • If the text starts with “Warning”, figure out if it’s something to worry about. For instance, if you get a warning about missing values in a scatterplot and you know there are missing values, you’re fine. If that’s surprising, look at your data and see what’s missing. Think of warnings as a yellow traffic light: everything is working fine, but watch out/pay attention.
  • Otherwise the text is just a message. Read it, wave back at R, and thank it for talking to you. Think of messages as a green traffic light: everything is working fine.

R packages

Another point of confusion with many new R users is the idea of an R package. R packages extend the functionality of R by providing additional functions, data, and documentation. They are written by a world-wide community of R users and can be downloaded for free from the internet. For example, among the many packages we will use in this book are:

  • The ggplot2 package for data visualization in Chapter @ref(viz).
  • The dplyr package for data wrangling in Chapter @ref(wrangling).
  • The moderndive package that accompanies this book.
  • The infer package for “tidy” and transparent statistical inference in Chapters @ref(confidence-intervals), @ref(hypothesis-testing), and @ref(inference-for-regression).

A good analogy for R packages is they are like apps you can download onto a mobile phone:

R: A new phone R Packages: Apps you can download

So R is like a new mobile phone: while it has a certain amount of features when you use it for the first time, it doesn’t have everything. R packages are like the apps you can download onto your phone from Apple’s App Store or Android’s Google Play.

Let’s continue this analogy by considering the Instagram app for editing and sharing pictures. Say you have purchased a new phone and you would like to share a recent photo you have taken on Instagram. You need to:

  1. Install the app: Since your phone is new and does not include the Instagram app, you need to download the app from either the App Store or Google Play. You do this once and you’re set. You might do this again in the future any time there is an update to the app.
  2. Open the app: After you’ve installed Instagram, you need to open the app.

Once Instagram is open on your phone, you can then proceed to share your photo with your friends and family. The process is very similar for using an R package. You need to:

  1. Install the package: This is like installing an app on your phone. Most packages are not installed by default when you install R and RStudio. Thus if you want to use a package for the first time, you need to install it first. Once you’ve installed a package, you likely won’t install it again unless you want to update it to a newer version.
  2. “Load” the package: “Loading” a package is like opening an app on your phone. Packages are not “loaded” by default when you start RStudio on your computer; you need to “load” each package you want to use every time you start RStudio.

Let’s now show you how to perform these two steps for the ggplot2 package for data visualization.

Package installation

There are two ways to install an R package. For example, to install the ggplot2 package:

  1. Easy way: In the Files pane of RStudio:
    1. Click on the “Packages” tab
    2. Click on “Install”
    3. Type the name of the package under “Packages (separate multiple with space or comma):” In this case, type ggplot2
    4. Click “Install”
  2. Slightly harder way: An alternative but slightly less convenient way to install a package is by typing install.packages("ggplot2") in the Console pane of RStudio and hitting enter. Note you must include the quotation marks.

Package loading

Recall that after you’ve installed a package, you need to “load” it, in other words open it. We do this by using the library() command. For example, to load the ggplot2 package, run the following code in the Console pane. What do we mean by “run the following code”? Either type or copy & paste the following code into the Console pane and then hit the enter key.

If after running the above code, a blinking cursor returns next to the > “prompt” sign, it means you were successful and the ggplot2 package is now loaded and ready to use. If however, you get a red “error message” that reads…

Error in library(ggplot2) : there is no package called ‘ggplot2’

… it means that you didn’t successfully install it. In that case, go back to the previous subsection “Package installation” and install it.

Package use

One extremely common mistake new R users make when wanting to use particular packages is they forget to “load” them first by using the library() command we just saw. Remember: you have to load each package you want to use every time you start RStudio. If you don’t first “load” a package, but attempt to use one of its features, you’ll see an error message similar to:

Error: could not find function

R is telling you that you are trying to use a function in a package that has not yet been “loaded.” Almost all new users forget do this when starting out, and it is a little annoying to get used. However, you’ll remember with practice.

Tips on learning to code

Learning to code/program is very much like learning a foreign language, it can be very daunting and frustrating at first. Such frustrations are very common and it is very normal to feel discouraged as you learn. However just as with learning a foreign language, if you put in the effort and are not afraid to make mistakes, anybody can learn.

Here are a few useful tips to keep in mind as you learn to program:

  • Remember that computers are not actually that smart: You may think your computer or smartphone are “smart,” but really people spent a lot of time and energy designing them to appear “smart.” Rather you have to tell a computer everything it needs to do. Furthermore the instructions you give your computer can’t have any mistakes in them, nor can they be ambiguous in any way.
  • Take the “copy, paste, and tweak” approach: Especially when learning your first programming language, it is often much easier to taking existing code that you know works and modify it to suit your ends, rather than trying to write new code from scratch. We call this the copy, paste, and tweak approach. So early on, we suggest not trying to write code from memory, but rather take existing examples we have provided you, then copy, paste, and tweak them to suit your goals. Don’t be afraid to play around!
  • The best way to learn to code is by doing: Rather than learning to code for its own sake, we feel that learning to code goes much smoother when you have a goal in mind or when you are working on a particular project, like analyzing data that you are interested in.
  • Practice is key: Just as the only method to improving your foreign language skills is through practice, practice, and practice; so also the only method to improving your coding is through practice, practice, and practice. Don’t worry however; we’ll give you plenty of opportunities to do so!

Seeking help

Use the built-in RStudio help interface to search for more information on R functions

RStudio help interface.

RStudio help interface.

One of the fastest ways to get help, is to use the RStudio help interface. This panel by default can be found at the lower right hand panel of RStudio. As seen in the screenshot, by typing the word “Mean”, RStudio tries to also give a number of suggestions that you might be interested in. The description is then shown in the display window.

I know the name of the function I want to use, but I’m not sure how to use it

If you need help with a specific function, let’s say barplot(), you can type:

If you just need to remind yourself of the names of the arguments, you can use:

I want to use a function that does X, there must be a function for it but I don’t know which one…

If you are looking for a function to do a particular task, you can use the help.search() function, which is called by the double question mark ??. However, this only looks through the installed packages for help pages with a match to your search request

If you can’t find what you are looking for, you can use the rdocumentation.org website that searches through the help files across all packages available.

Finally, a generic Google or internet search “R <task>” will often either send you to the appropriate package documentation or a helpful forum where someone else has already asked your question.

I am stuck… I get an error message that I don’t understand

Start by googling the error message. However, this doesn’t always work very well because often, package developers rely on the error catching provided by R. You end up with general error messages that might not be very helpful to diagnose a problem (e.g. “subscript out of bounds”). If the message is very generic, you might also include the name of the function or package you’re using in your query.

However, you should check Stack Overflow. Search using the [r] tag. Most questions have already been answered, but the challenge is to use the right words in the search to find the answers: http://stackoverflow.com/questions/tagged/r

The Introduction to R can also be dense for people with little programming experience but it is a good place to understand the underpinnings of the R language.

The R FAQ is dense and technical but it is full of useful information.